App review analysis¶

  • This is my attempt to visualise what people are saying about an app in the google play store
  • Can the reviews be quantified?
  • Are we able to gauge opinions?
  • Can we visualise items or topics of interest in reviews?

Basic cleaning, merging and concatenating¶

In [1]:
!pip install scipy==1.10.1
Collecting scipy==1.10.1
  Using cached scipy-1.10.1-cp311-cp311-macosx_10_9_x86_64.whl (35.0 MB)
Collecting numpy<1.27.0,>=1.19.5
  Using cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl (20.6 MB)
Installing collected packages: numpy, scipy
Successfully installed numpy-1.26.4 scipy-1.10.1

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
In [8]:
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import seaborn as sns
from langdetect import detect, LangDetectException
import string
In [34]:
df_gb = pd.read_csv('app-reviews_en-gb.csv', sep='|')
df_us = pd.read_csv('app-reviews_en-us.csv', sep='|')
In [73]:
# Check for duplicates within each df
duplicates_in_df_gb = df_gb[df_gb.duplicated(subset=['reviewId', 'userName', 'appVersion'], keep=False)]
duplicates_in_df_us = df_us[df_us.duplicated(subset=['reviewId', 'userName', 'appVersion'], keep=False)]

# Check for duplicates between the dfs
merged_df = pd.merge(df_gb, df_us, on='reviewId', how='inner')

# Check for duplicate 'reviewId' values in the merged df
duplicates_between_dfs = merged_df[merged_df.duplicated(subset='reviewId', keep=False)]

print("Duplicates within df_gb:")
print(duplicates_in_df_gb)
print("-"*70)
print("Duplicates within df_us:")
print(duplicates_in_df_us)
print("-"*70)
print("Duplicates between the DataFrames:")
print(duplicates_between_dfs)
Duplicates within df_gb:
Empty DataFrame
Columns: [reviewId, userName, userImage, content, score, thumbsUpCount, reviewCreatedVersion, at, replyContent, repliedAt, appVersion]
Index: []
----------------------------------------------------------------------
Duplicates within df_us:
Empty DataFrame
Columns: [reviewId, userName, userImage, content, score, thumbsUpCount, reviewCreatedVersion, at, replyContent, repliedAt, appVersion]
Index: []
----------------------------------------------------------------------
Duplicates between the DataFrames:
Empty DataFrame
Columns: [reviewId, userName_x, userImage_x, content_x, score_x, thumbsUpCount_x, reviewCreatedVersion_x, at_x, replyContent_x, repliedAt_x, appVersion_x, userName_y, userImage_y, content_y, score_y, thumbsUpCount_y, reviewCreatedVersion_y, at_y, replyContent_y, repliedAt_y, appVersion_y]
Index: []

[0 rows x 21 columns]
In [35]:
result_df = pd.concat([df_gb, df_us], ignore_index=True)

duplicates_in_result = result_df[result_df.duplicated(subset=['reviewId', 'userName', 'appVersion'], keep=False)]
print("Duplicates within result:", len(duplicates_in_result))
Duplicates within result: 59368

Quite a few duplicates¶

  • It's like the script used to scrape reviews from the google play store wasn't perfect
  • Or there is some bleeding between the UK and US google play stores

Never Mind though, as this is a test and we have more than 30,000 records after removing duplicates 👇¶

In [36]:
result_df_no_duplicates = result_df.drop_duplicates(subset=['reviewId', 'userName', 'appVersion'], keep='first')

print("Shape of df after removing duplicates:", result_df_no_duplicates.shape)
Shape of df after removing duplicates: (30316, 11)

Sentiment analysis¶

How does the rating of an app affect people's sentiment when writing a review?¶

Using TextBlob we can look at basic natural language while also:¶
  • Calculating a sentiment score
    • -1.0 to 1.0 (Negative - Positive) with 0.0 being Neutral
  • We can group sentiment ratings into the score an see how people talk about an app based on the review
In [37]:
from textblob import TextBlob

df = result_df_no_duplicates

# Function to calculate sentiment polarity
def calculate_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Calculate sentiment polarity for each review
df['sentiment'] = df['content'].dropna().apply(calculate_sentiment)

# Group by rating and calculate average sentiment
avg_sentiment_by_rating = df.groupby('score')['sentiment'].mean().reset_index()

avg_sentiment_by_rating
/var/folders/jg/ncsr23s10gb8vqmfmtcp1vzc0000gn/T/ipykernel_82186/2641237175.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentiment'] = df['content'].dropna().apply(calculate_sentiment)
Out[37]:
score sentiment
0 1 -0.043658
1 2 0.028367
2 3 0.082904
3 4 0.305454
4 5 0.485364
In [16]:
plt.figure(figsize=(10, 5))
plt.bar(avg_sentiment_by_rating['score'], avg_sentiment_by_rating['sentiment'], color='pink')
plt.xlabel('Score')
plt.ylabel('Average Sentiment')
plt.title('Average Sentiment by Score')
plt.xticks(avg_sentiment_by_rating['score'])
plt.grid(axis='y')

plt.show()
No description has been provided for this image

Some findings¶

  • Reviewers on average do not take a completely negative tone when giving the app a bad score (1)
  • Every score from 2 and above is positive in some way shape or form
  • A score of 5 (sentiment: 0.49) are moderately positive on average

There seems to be a clear correlation between review sentiment and app score¶

What's the spread of scoring like?¶

In [61]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='score', bins=[1, 2, 3, 4, 5, 6], kde=False, discrete=True)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Histogram of Ratings (1-5)')

plt.show()
No description has been provided for this image

Polar scoring¶

  • People either love it or hate it

Does sentiment and score correlate generally over time?¶

In [71]:
non_nan_count = df['content'].notna().sum()
print("Number of non-NaN rows in 'content' column:", non_nan_count)
Number of non-NaN rows in 'content' column: 30316
In [77]:
tdf = df.copy()

tdf['at'] = pd.to_datetime(tdf['at'], format='%Y-%m-%d %H:%M:%S')
tdf['content'].fillna("", inplace=True)
tdf['sentiment'] = tdf['content'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
tdf[['rating', 'sentiment']] = tdf[['score', 'sentiment']].apply(pd.to_numeric, errors='coerce')

tdf.set_index('at', inplace=True)

# Resample by day and calculate mean for 'rating' and 'sentiment'
tdf_resampled = tdf[['rating', 'sentiment']].resample('D').mean().reset_index()

# Calculate EMAs for both 'rating' and 'sentiment'
ema_span = 30.417  # Approximately one month
tdf_resampled['ema_rating'] = tdf_resampled['rating'].ewm(span=ema_span).mean()
tdf_resampled['ema_sentiment'] = tdf_resampled['sentiment'].ewm(span=ema_span).mean()

plt.figure(figsize=(15, 7))

# Plot EMA of rating
ax1 = plt.gca()  # Get current axis
ax2 = ax1.twinx()  # Create another axis that shares the same x-axis

ax1.plot(tdf_resampled['at'], tdf_resampled['ema_rating'], label='EMA Rating', color='blue')
ax1.set_xlabel('Review Date')
ax1.set_ylabel('Average Rating', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Plot EMA of sentiment on the secondary y-axis
ax2.plot(tdf_resampled['at'], tdf_resampled['ema_sentiment'], label='EMA Sentiment', color='green')
ax2.set_ylabel('Average Sentiment', color='green')
ax2.tick_params(axis='y', labelcolor='green')

plt.title(f'Average Rating and Sentiment Over Time with EMA (Span = {ema_span} days)')
plt.grid(True)
plt.tight_layout()

# Add legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

plt.show()
No description has been provided for this image

Sentiment and ratings/score are very closely correlated over time¶

There is a steep decrease in the number of reviews before 2019

In [81]:
# Check the number of reviews per year
tdf.reset_index(inplace=True)
year_count = tdf['at'].dt.year.value_counts().sort_index()

year_count
Out[81]:
at
2018      55
2019    3584
2020    5267
2021    9044
2022    8354
2023    4012
Name: count, dtype: int64

Thumbs up??¶

  • People can give reviews a 'thumbs up'
  • We will make the assumption that a 'thumbs up' 👍 is an agreement to the review
In [83]:
# Calculate the average thumbsUpCount for each rating
average_thumbs_up = df.groupby('score')['thumbsUpCount'].mean()

# Plot the average thumbsUpCount for each rating
plt.figure(figsize=(10, 6))
plt.bar(average_thumbs_up.index, average_thumbs_up.values, color='blue')
plt.xlabel('Rating')
plt.ylabel('Average Thumbs Up Count')
plt.title('Average Thumbs Up Count per Rating')
plt.xticks(average_thumbs_up.index)
plt.show()
No description has been provided for this image

People tend to agree with a review when it is less than positive¶

We'l call a score of 2 the 'elbow point'¶

  • We can correlate this with sentiment and say that people like:
    • Score 1 (-0.043658): Reviews with the lowest rating (1) with a slightly negative average sentiment, which is expected as these are presumably critical or unsatisfied reviews
    • Score 2 (0.028367): Slightly positive sentiment on average, indicating that reviews with a rating of 2 still contain a mix of criticism and faint praise or less intense negativity

What topics matter most to reviewers of the app?¶

To do this, we employ natural language processing and Latent Dirichlet Allocation (LDA) to observe/ explain why certain parts of the data are similar¶

In [28]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models

# download stop words for multiple languages
ldf = df.copy()

nltk.download('punkt')
stop_words = set(nltk.corpus.stopwords.words(['english', 'german', 'spanish', 'swedish']))

# additional words to exclude
exclude_words = {'ca', 'app', 'klarna'}

# Tokenize and clean text
tokenized_data = []
for review in ldf['content']:
    if not isinstance(review, str):
        continue
    try:
        if detect(review) == 'en': # detect the language of the review
            tokens = nltk.word_tokenize(review.lower())
            tokens = [word for word in tokens if word.isalpha() and word not in stop_words and word not in exclude_words]
            tokenized_data.append(tokens)
    except LangDetectException:
        continue  # skip the review if language can't be detected

# create a dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# generate the LDA model with num_topics topics
lda_model = LdaModel(corpus, num_topics=7, id2word=dictionary, passes=15, random_state=1)

# visualize topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[28]:

Looking at our principal components, we can clearly see 3 groups:¶

  • Group 1¶

    • Topics 1, 3, 4, 5
  • Group 2¶

    • Topics 2, 6
  • Group 3¶

    • Topic 7
In [29]:
# Initialize topic proportion counters
topic_counter = np.zeros(7)  # num_topics

# Go through the corpus and get the topic distribution for each document
for doc in corpus:
    topic_distribution = lda_model.get_document_topics(doc)
    for topic, proportion in topic_distribution:
        topic_counter[topic] += proportion

# Normalize the counts to get proportions
topic_proportions = topic_counter / topic_counter.sum()

plt.figure(figsize=(12, 6))
plt.bar(range(1, len(topic_proportions) + 1), topic_proportions)
plt.xlabel('Topic Number')
plt.ylabel('Proportion')
plt.title('Scree Plot of Topic Proportions')
plt.show()
No description has been provided for this image
In [76]:
from langdetect import detect, LangDetectException

# Initialize stop words
stop_words = set(nltk.corpus.stopwords.words('english'))

# Additional words to exclude
exclude_words = {'ca', 'app', 'klarna'}

# Tokenize and clean text
tokenized_data = []
for review in df['review_description']:
    try:
        # Detect the language of the review
        if detect(review) == 'en':
            tokens = nltk.word_tokenize(review.lower())
            tokens = [word for word in tokens if word.isalpha() and word not in stop_words and word not in exclude_words]
            tokenized_data.append(tokens)
    except LangDetectException:
        continue  # Skip the review if language can't be detected

# Create a dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# Generate the LDA model
lda_model = LdaModel(corpus, num_topics=7, id2word=dictionary, passes=15)

# Visualize topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
Out[76]:
In [77]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
import nltk

# Download stop words
nltk.download('punkt')
stop_words = set(stopwords.words(['english', 'german', 'spanish', 'swedish']))

# Additional words to exclude
exclude_words = {'ca', 'app', 'klarna'}

# Tokenize and clean text
tokenized_data = []
for review in df['review_description']:
    tokens = nltk.word_tokenize(review.lower())
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words and word not in exclude_words]
    tokenized_data.append(tokens)

# Create a dictionary and corpus
dictionary = corpora.Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# Generate the LDA model
lda_model = LdaModel(corpus, num_topics=32, id2word=dictionary, passes=15)

# Visualize topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[77]:
In [30]:
from gensim.models.coherencemodel import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print(f'Coherence Score: {coherence_lda}')
Coherence Score: 0.5973972930487592

The below function allows us to determine the best number of topics¶

  • By calculating a conherence score, we can tweak the model to get the best results
  • It's an incredibly slow process, do not run it more than once 😅
In [27]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

# Function call
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_data, start=2, limit=40, step=4)

# Plotting
limit=40; start=2; step=4;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')

plt.minorticks_on()
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='gray')

plt.show()
No description has been provided for this image

How coherent is this data?¶

Does the language make sense, and can meaning be identified?¶

  • Yes, with a score of 0.597 the coherence is useable
  • A coherence score of 0.59 with 7 topics represents a solid foundation for extracting meaning from the data in a useable fashion

Generate key words for each topic¶

In [ ]:
 
In [32]:
# Show top N keywords for each topic
top_topics = lda_model.show_topics(num_topics=7, num_words=20, formatted=False)

for i, topic in enumerate(top_topics):
    print(f"Topic {i+1}:")
    print(", ".join([word[0] for word in topic[1]]))
Topic 1:
get, phone, update, even, email, account, number, trying, keeps, let, open, try, work, working, log, tried, use, please, cant, go
Topic 2:
love, easy, use, pay, great, payments, way, awesome, buy, need, get, want, things, really, convenient, helpful, payment, later, helps, shop
Topic 3:
service, customer, order, money, account, never, company, bank, get, still, take, help, back, got, would, wait, days, took, refund, support
Topic 4:
payment, purchase, pay, time, payments, used, make, paid, never, using, purchases, first, made, power, due, always, even, use, amount, one
Topic 5:
card, credit, use, time, one, approved, get, let, ghost, even, would, cards, better, waste, give, purchase, like, could, buy, afterpay
Topic 6:
great, best, love, service, shopping, experience, ever, thank, absolutely, shop, excellent, much, thing, thanks, way, amazing, christmas, company, easier, life
Topic 7:
good, works, really, slow, like, far, work, cool, pretty, problems, needs, well, everything, sometimes, load, issues, crashing, see, website, ok

Generating more words for topics to pass to a LLM to identify themes¶

Topics and Their Themes¶

Topic 1: Technical Issues and Usability Problems

  • Keywords: get, phone, update, email, account, number, trying, keeps, let, open, try, work, working, log, tried, use, please, cant, go
  • Interpretation: This topic likely covers reviews discussing difficulties with using the app, such as problems with logging in, the app not working as expected, issues after updates, or trouble with account access.

Topic 2: Positive Experience with App Functionality

  • Keywords: love, easy, use, pay, great, payments, way, awesome, buy, need, get, want, things, really, convenient, helpful, payment, later, helps, shop
  • Interpretation: Reviews in this topic seem to express satisfaction with the app’s ease of use, convenience for making payments, and the ability to buy now and pay later. Users appreciate the app's functionality that makes shopping easier and more flexible.

Topic 3: Customer Service and Order Issues

  • Keywords: service, customer, order, money, account, never, company, bank, get, still, take, help, back, got, would, wait, days, took, refund, support
  • Interpretation: This topic focuses on experiences with customer service, issues with orders, and financial transactions. It includes mentions of delays, problems with refunds, and dissatisfaction with how the company handles customer support.

Topic 4: Payment Process and Terms

  • Keywords: payment, purchase, pay, time, payments, used, make, paid, never, using, purchases, first, made, power, due, always, even, use, amount, one
  • Interpretation: Reviews categorized under this topic are likely about the payment process, the experience of making payments, specific purchases, and the terms of payment. This could also touch upon the reliability and timing of payments.

Topic 5: Credit and Approval Issues

  • Keywords: card, credit, use, time, one, approved, get, let, ghost, even, would, cards, better, waste, give, purchase, like, could, buy, afterpay
  • Interpretation: This topic seems to revolve around credit issues, such as problems with credit cards, approval for use, and comparisons with other services like Afterpay. The mention of "ghost" could refer to ghost cards or temporary digital credit card numbers.

Topic 6: General Praise and Shopping Experience

  • Keywords: great, best, love, service, shopping, experience, ever, thank, absolutely, shop, excellent, much, thing, thanks, way, amazing, christmas, company, easier, life
  • Interpretation: Reviews in this topic express overall praise for the app, highlighting an excellent shopping experience, gratitude towards the service, and how the app has made their life easier, especially around the holiday season like Christmas.

Topic 7: App Functionality and Performance

  • Keywords: good, works, really, slow, like, far, work, cool, pretty, problems, needs, well, everything, sometimes, load, issues, crashing, see, website, ok
  • Interpretation: This topic discusses the app's functionality and performance, with a mix of positive feedback and criticisms related to app speed, reliability, and occasional problems like crashes or slow load times.

Assign a topic to each review (English language)¶

This can now be given to interested parties to match topics to each review for further analysis or insights¶

Generate a word cloud for each rating, which allows for quick and easy identification of related words in each rating¶

In [30]:
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
import nltk

# Download required resources for NLTK
nltk.download('punkt')
nltk.download('stopwords')

df

# Define custom words to exclude
custom_exclude = ['app', 'klarna', 'ca', 'appen']

# Create a custom colormap
colors = ["#0000FF", "#FF69B4"]  # Blue to Pink
cmap = LinearSegmentedColormap.from_list("custom", colors, N=256)

# Function to generate word cloud
def generate_word_cloud(text):
    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=cmap).generate_from_frequencies(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Define stop words
stop_words = set(stopwords.words('english'))
stop_words.update(stopwords.words('swedish'))
stop_words.update(stopwords.words('german'))
stop_words.update(stopwords.words('norwegian'))
stop_words.update(stopwords.words('spanish'))

# Group the DataFrame by 'rating' and combine the 'review_description' text for each group
grouped_reviews = df.groupby('rating')['review_description'].apply(lambda x: ' '.join(x.dropna().astype(str))).reset_index()


# Generate word clouds for each rating group
for _, row in grouped_reviews.iterrows():
    print(f"Word Cloud for Rating {row['rating']}")
    # Tokenize the text and remove stopwords and custom words
    tokens = [word for word in word_tokenize(row['review_description'].lower()) if word not in stop_words and word not in custom_exclude and word.isalpha()]
    
    # Count the frequency of each word (unigram)
    word_freq = Counter(tokens)
    
    # Create bigrams
    bigrams = list(ngrams(tokens, 2))
    
    # Count the frequency of each bigram
    bigram_freq = Counter(map(lambda x: ' '.join(x), bigrams))
    
    # Merge unigram and bigram frequencies for a more complete word cloud
    merged_freq = word_freq + bigram_freq
    
    # Generate the word cloud
    generate_word_cloud(merged_freq)
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Word Cloud for Rating 1
/Users/dominiclove/Documents/threads-analysis/venv/lib/python3.11/site-packages/wordcloud/wordcloud.py:106: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  self.colormap = plt.cm.get_cmap(colormap)
No description has been provided for this image
Word Cloud for Rating 2
/Users/dominiclove/Documents/threads-analysis/venv/lib/python3.11/site-packages/wordcloud/wordcloud.py:106: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  self.colormap = plt.cm.get_cmap(colormap)
No description has been provided for this image
Word Cloud for Rating 3
/Users/dominiclove/Documents/threads-analysis/venv/lib/python3.11/site-packages/wordcloud/wordcloud.py:106: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  self.colormap = plt.cm.get_cmap(colormap)
No description has been provided for this image
Word Cloud for Rating 4
/Users/dominiclove/Documents/threads-analysis/venv/lib/python3.11/site-packages/wordcloud/wordcloud.py:106: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  self.colormap = plt.cm.get_cmap(colormap)
No description has been provided for this image
Word Cloud for Rating 5
/Users/dominiclove/Documents/threads-analysis/venv/lib/python3.11/site-packages/wordcloud/wordcloud.py:106: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  self.colormap = plt.cm.get_cmap(colormap)
No description has been provided for this image

Take away notes¶

  1. Most Reviews are Positive - though this is a small dataset that only contains 20,000 reviews from the Google Play Store (Android)
  2. People Love to Hate - though there are few low scoring reviews, most people will interact with and 'like' these reviews over positive reviews
  3. Positve Reviews - users prefer strong language, things like 'love' and 'great'
  4. Correlation between multiple themes - Strong themes include 'trust', 'usability', 'the experience' and 'help/service'

To note¶

  1. This is a hoslitic analysis of Android users - For better insights, analysis could be focused on a particular time of year or app version
  2. iPhone - does the experience change according to mobile OS?
  3. Separate app usage from product - for in depth analysis of either the product of app, NLP can be further exploited to drill into different analyses

Wordcloud per score/rating¶

In [39]:
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
import nltk
import random  # Import the random module

# Set a random seed for reproducibility
random.seed(42)  # You can choose any integer as the seed

# Download required resources for NLTK
nltk.download('punkt')
nltk.download('stopwords')

ddf = df.copy()

# Define custom words to exclude
custom_exclude = ['app', 'klarna', 'ca', 'appen']

# Create a custom colormap
colors = ["#0000FF", "#FF69B4"]
cmap = LinearSegmentedColormap.from_list("custom", colors, N=256)

# Function to generate word cloud
def generate_word_cloud(text):
    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=cmap).generate_from_frequencies(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Define stop words
stop_words = set(stopwords.words('english'))
#stop_words.update(stopwords.words('swedish'))
#stop_words.update(stopwords.words('german'))
#stop_words.update(stopwords.words('norwegian'))
#stop_words.update(stopwords.words('spanish'))

# Group the DataFrame by 'rating' and combine the 'review_description' text for each group
grouped_reviews = ddf.groupby('score')['content'].apply(lambda x: ' '.join(x.dropna().astype(str))).reset_index()

# Generate word clouds for each rating group
for _, row in grouped_reviews.iterrows():
    print(f"Word Cloud for Rating {row['score']}")
    # Tokenize the text and remove stopwords and custom words
    tokens = [word for word in word_tokenize(row['content'].lower()) if word not in stop_words and word not in custom_exclude and word.isalpha()]
    
    # Count the frequency of each word (unigram)
    word_freq = Counter(tokens)
    
    # Create bigrams
    bigrams = list(ngrams(tokens, 2))
    
    # Count the frequency of each bigram
    bigram_freq = Counter(map(lambda x: ' '.join(x), bigrams))
    
    # Merge unigram and bigram frequencies for a more complete word cloud
    merged_freq = word_freq + bigram_freq
    
    # Generate the word cloud
    generate_word_cloud(merged_freq)
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dominiclove/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Word Cloud for Rating 1
No description has been provided for this image
Word Cloud for Rating 2
No description has been provided for this image
Word Cloud for Rating 3
No description has been provided for this image
Word Cloud for Rating 4
No description has been provided for this image
Word Cloud for Rating 5
No description has been provided for this image

Wordcloud per topic¶

In [44]:
# The below assumes the lda_model is already trained and available

# Create a custom colormap for your word clouds
colors = ["#0000FF", "#FF69B4"]  # Blue to Pink
cmap = LinearSegmentedColormap.from_list("custom", colors, N=256)

# Function to generate word cloud from word frequencies
def generate_word_cloud(word_freqs):
    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=cmap).generate_from_frequencies(word_freqs)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Show top N keywords for each topic, setting N to 50
top_topics = lda_model.show_topics(num_topics=7, num_words=75, formatted=False)

for topic_num, words in top_topics:
    print(f"Word Cloud for Topic {topic_num+1}")
    # Convert the list of (word, probability) tuples to a dictionary suitable for WordCloud
    word_freqs = {word: prob for word, prob in words}
    # Generate the word cloud
    generate_word_cloud(word_freqs)
Word Cloud for Topic 1
No description has been provided for this image
Word Cloud for Topic 2
No description has been provided for this image
Word Cloud for Topic 3
No description has been provided for this image
Word Cloud for Topic 4
No description has been provided for this image
Word Cloud for Topic 5
No description has been provided for this image
Word Cloud for Topic 6
No description has been provided for this image
Word Cloud for Topic 7
No description has been provided for this image
In [ ]: